Groover Technical test

This notebook presents the results obtained in the context of the Groover data science internship test by Julien Guinot. Part 1 is a summarization of a research paper which can be found in pdf file in the directory in which this notebook was sent

Part 2

In part 2, we are provided with 19 raw WAV files containing samples of music from the Groover library. All samples are 30s long, though not exactly of the same size. For image feature extraction such as spectrograms, we will have to apply padding to the images to attain fixed-length input . Our goal is to extract audio features to be later used for a given machine learning task.

That being said, we are looking to extract audio features which are crucial for classification tasks or recommendation tasks. many common audio features exist, including but not limited to:

We will be building a dataset comprised of audio features for both song level and beat level features. This means we will compute the features for each song (30s clip), attribute a tempo to the song and split the signal according to this evaluated tempo, and re-compute the features for each beat split from the original signal. This will provide a global feature value for each clip and also allow us to look into more detail, which will be useful for tasks such as chorus loudness analysis when compared to the verse.

All this can be achieved using the Librosa library, which is particularly adapted for audio signal feature extraction tasks.

Part 3

For part 3, we are provided with a pre-constructed dataset of audio features for 1000 fixed length songs. the goal here is to build a genre classification model based on the provided features with respect to the single-label ground truth column in the dataset, which will be explored later. To do this, we will:

Basic imports

Part 2

As previously stated, the goal of this section is to extract audio features from the 20 audio files provided as raw data. Many available audio features are available to be extracted and useful for audio tasks, but we focus on a few criteria:

This being said, we choose to focus on the following features:

High-level

Mid-level

Low-Level

The code containing all of the processing steps can be found at the following github repo in src/Audio_proc_utils.py

https://github.com/Pliploop/Groover_Tech_Test

Our data was successfully extracted through the build pipeline. It would have been appreciable to conduct exploratory data analysis of the results. Unfortunately with 20 samples and the alloted time frame, this was not possible over the course of this study.

To-do

Part 3

The dataset has no missing values, which leads us to believe that it does not have to be cleaned.

Furthermore, the classes are already balanced, which greatly facilitates the task for classification, but does not truly represent the data we would get from real-life data scrapping and genre distribution in popular music

Introduction here

Though this is not the main goal of this section, some remarks can be made about the previous visalizations :

Data pre-processing

Min-max scaling, one-hot encoding and label encoding

We use min-max scaling to constrain all the features to the same range of (0-1). this allows us to negate the problems linked with high feature values leading to exploding weights. Furthermore, we use label encoding for single-output models ground truth (SVM, K-means), and one-hot encoding for multi-output models (SVM, MLP)

Train-test-validation split

We use a 80-10-10 train-test-validation due to the low amount of data we have (1000 samples). We will also be validating our optimal model through K-fold cross validation later on.

metrics

Multiclass classification metrics

The usual multiclass classification metrics are the following

In our case, recall is the sub-metric which interests us the most, as it represents the percentage of the time the model is able to predict each class wheren presented with a positive sample for the class. Essentially, our use case makes us consider the optimal model to be one that does not overshoot but retrieves the correct genre often when presented with each genre, with oppiosition to precision, which minimises false predictions.

We use macro-averaged metrics as there is no class imbalance and thus the arithmetic mean of each metric does not need to take into account class imbalance, which micro-averaging is for.

Single-class one vs all metrics

Once our optimal model is established, it is interesting to consider its performance on all of the classes individually by considering a binary classification problem. This is useful to identify classes where classification does not function optimally to devise preprocessing strategies for better results.

Testing some out-of-the-box models

In this section, we implement some basic classification algorithms from sklearn, out of the box. We use provided default values or values obtained by quick trial-and-error to establish a baseline for all models, and use k-fold validation to compute average accuracy, maximum accuracy, and minimum accuracy for each model.

This will allow us to isolate our top 3 models which we will then fine-tune to reach best accuracy values.

Three models come out on top: Gradient boosted algorithms, with on average 83% accuracy, MLP with on average 77% and Random forest with 72% on average. We will be moving forward with these models to decide which one we will be choosing as our final model.

10-Fold cross validation reveals our top 3 optimal models. We rely not only on pure accuracy, but also recall, shown previously to be our metric of preference for this use case, explainability and customization capabilities. Our attention lands on these three models, which will now be submitted to a hyperparameter grid search to determine optimal hyperparameter values on the test set:

and all of these models because they scored top 3 in terms of best accuracy with our test data validation. We establish the following grid search dictionaries for our selected models

On a macro-averaged metric standpoint, the best gradient boosted algorithm seems to beat out both others by a healthy 4 and 9 percent margin on accuracy. We will now look at single-class classification metrics to determine whether or not our best models seems to have any particular weaknesses in terms of classes.

There is not true observable discrepancy in terms of class metrics for all three models apart from three : Rock, Blues, and Country. This is not so surprising due to the similar natures of these music genres. On the other hand, all models perform well on classifical, jazz and disco, which, as we saw before, all present distinct feature differences compared to the other classes. If we now decide to visualize confusion matrixes on the test set for each model, this is what we get:

We can see that thankfully, not many samples are present outside of the diagonal. However, when they are, some are coherent misclassifications, one a human might even make:

Overall, there is no real trend for any given class to be misclassified apart from rock-blues-country samples, which, due to the similarities of the styles, is coherent. Additional pre-processing should be set up for these genres to further distinguish them.

Explainability and feature importance

Contrary to

Explainability for The gradient boosted algorithm and Random forest is rather easy to obtain. These are not black box models in which the decision process is left to guess for the user. We easily visualize a single decision tree from the 500 of the optimal model for both random forest and Gradient Boost below:

The goal here is not to visualize the best tree, but rather show the potential for explainability that random forest and gradient boosted algorithms compared to the somewhat black-box MLP model

Final Model

So, based on K-fold verified accuracy and recall, and due to the high explainability of the model, our model of choice is the best Gradient Boosted algorithm found by the grid search, reaching 84% accuracy on our test set. It goes without saying that more complex models could be trained outside of the context of this technical test: for instance, a custom MLP with dropout layers and further grid search could probably reach about 90% accuracy, as shown in this paper:

https://biblio.ugent.be/publication/5973853

But with the resources available for this 1-Week study, the obtained accuracy is reasonable with regards to state of the art models. As previewed in part 2, further analysis could be done with harmonic-to-percussive ratios, keys, or even computer vision algorithms applied to the generated spectrograms. It would be interesting to explore these directions with a more voluminous and real-world dataset in the future.

Thank you for your time! If you have any questions, please email me.

To-do